Pacific Association for Computational Linguistics VECTOR SPACE MODEL BASED ON SEMANTIC ATTRIBUTES OF WORDS
نویسندگان
چکیده
In order to reduce the dimension of VSM (Vector Space Model) for information retrieval and clustering, this paper proposes a new method, Semantic-VSM, which uses the Semantic Attribute System defined by ”A-Japanese-Lexicon” instead of literal words used in conventional VSM. The attribute system consists of a tree structure with 2,710 attributes, which includes 400 thousand literal words. Using this attribute system, the generalization of vector elements can be performed easily based on upper-lower relationships of semantic attributes, so that the dimension can easily be reduced at very low cost. Synonyms are automatically assessed through semantic attributes to improve the recall performance of retrieval systems. Experimental results applying it to BMIR-J2 database of 5,079 newspaper articles showed that the dimension can be reduced from 2,710 to 300 or 600 with only a small degradation in performance. High recall performance was also shown compared with conventional VSM.
منابع مشابه
Vector Space Modelling of Natural Language
Vector space modelling of words has been a focus of research in the field of computational linguistics over the past decade. Aim of Vector Space Models is to project the words in a text corpus onto a vector space, such that the semantically similar words lie close together in the vector space (also called, semantic space). Recently, the research focus has shifted towards semantic compositionali...
متن کاملProducing a Persian Text Tokenizer Corpus Focusing on Its Computational Linguistics Considerations
The main task of the tokenization is to divide the sentences of the text into its constituent units and remove punctuation marks (dots, commas, etc.). Each unit is a continuous lexical or grammatical writing chain that is an independent semantic unit. Tokenization occurs at the word level and the extracted units can be used as input to other components such as stemmer. The requirement to create...
متن کاملCorpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملA Large-scale Lexical Semantic Knowledge-base of Chinese
The Semantic Knowledge-base of Contemporary Chinese (SKCC) is a large scale Chinese semantic resource developed by the Institute of Computational Linguistics of Peking University. It provides a large amount of semantic information such as semantic hierarchy and collocation features for 66,539 Chinese words and their English counterparts. Its POS and semantic classification represent the latest ...
متن کاملHistorical semantics attributes of Prophet Ibrahim (AS) in the Holy Quran (Case Study Hanif words, certainly, Avah, Aime, model)
So how journey of Prophet Ibrahim (AS) is a specific manifestation in the Qur'an. So far, research-lot about some of them have been completed. Surveys show that about Hanif etymology approach because it signified change, have been in vain. In certain cases, the root of it and in the case of the Islamic Ummah to the confluence of two words "my nation" and connect the two, based on the abstract m...
متن کامل